Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

[

]

-generation sequencing data quality

-generation sequencing technology has been very widely used in

l/medical researches. For instance, it has been incorporated with

poson technology to generate a new method for the mutation

nalysis. The new method is called the transposon sequencing

gy. Using this technology, millions of mutants can be identified

gle experiment. Based on the gene-wise transposon statistics,

genes can be identified using the density pattern analysis

es as discussed in Chapter 2 of this book. However, most existing

es for identifying essential genes are based on replicate-free data.

mption is that a single transposon sequencing data is noise-free

ise level can be well controlled. However, it has been recognised

sposon insertions are random events [Golden, et al., 2000;

h, et al., 2014; Baym, et al., 2016]. In other words, in replicated

on sequencing data, it is very unlikely that there will be identical

of transposon insertion distributions across replicates. An

site in one replicate may not be present in other replicates in the

periment. Even when it is presented across replicates, it is less

it to be inserted at the exact identical base pair across replicates.

een assumed that a well-designed study will ensure that all

on insertion sites in a target genome are covered in replicated data.

words, it assumes that the replicate number is the minimum set to

transposon insertion sites. In addition, the insertion frequency is

event and will differ between replicates at each insertion site. It

re a reasonable assumption that unobserved but true transposon

frequency can be discovered if replicated data are used. Based on

mption, a novel and more efficient approach may need to be

ed to deal with sequencing data noise for a better essential gene

tion. This kind of thinking can also be used for other areas

with the sequencing data such as sequencing assembling, where it

been exercised for replicate-free data. In fact, the de novo